• Probability & Statistics

Q. Why Probability and Statistics are important?

Ans : 
  •    Probability(means Chance) : It is a number that indicates how likely the event is to occur. 
      It is expressed as a number in the range from 0 and 1, or, using percentage notation, in the range from 0% to 100%.
  •    Statistics : Statistics means studying, collecting, analyzing, interpreting, and organizing data.

Q. What is Random Variable ?

Ans : A random variable is a variable whose value is a numerical outcome of a random phenomenon. There are two kinds of random variable:-

1. Discrete Random Variable  : takes one of the value of discrete set or finite set. 
    A discrete random variable X has a countable number of possible values.
    Example: Let X represent the sum of two dice.

    To graph the probability distribution of a discrete random variable, we construct a probability histogram.


2. Continuous Random Variable : takes all values in a given interval of numbers.
  • The probability distribution of a continuous random variable is shown by a density curve.
  • The probability that a continuous random variable X is exactly equal to a number is zero.

Q. What is Outlier?

Ans : An outlier is a single data point that goes far outside the average value of a group of statistics. Note : Mean & Variance can be corrupted by a single outlier. So we use Median to solve this issue. image.png

Q. What is difference between Population & Sample ?

    Ans : 
            Population : Whole data
            Sample  : Small data drawn from Population. A subset of Population.

            population mean is denoted by mue.
            Sample mean is denoted by x bar.
    Note : As the sample size increases, sample means reaches to population mean

image.png image-2.png

  • Gaussian Normal Distribution and its PDF(Probability Density Function) :

  • Gaussian distribution is a type of continuous probability distribution for a real-valued random variable.

  • The mean of the distribution determines the location of the center of the graph, the standard deviation determines the height and width of the graph and the total area under the normal curve is equal to 1.

image-3.png image-5.png

Q. Why Normal distribution is so important ?

Ans : because Many naturally-occurring phenomena tend to approximate the normal distribution.
     and Distribution are simple models for natural behaviours. Using properties of a distribution , we are conclude many info. about data.
  • Important Points :
  1. Normal distributions are symmetrical, but not all symmetrical distributions are normal.

  2. All normal distributions can be described by just two parameters: the mean and the standard deviation.

  3. mean= median =mode

  4. varinace is the measure of spread .For Standard Normal Distribution, Variance =1 and Mean = 0

  5. Kurtosis measures the thickness of the tail ends of a distribution in relation to the tails of a distribution. The normal distribution has a kurtosis equal to 3 .

  6. Skewness measures the degree of symmetry of a distribution. The normal distribution is symmetric and has a skewness of zero

  • leptokurtic = low kurtosis less than 3.0 (fat tails)
  • platykurtic = low kurtosis less than 3.0 (skinnier)
  1. Normal distribution follows the Central Limit Theorem .

Q. What is Central Limit Theorem ?

Ans : The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement text annotation indicator, then the distribution of the sample means will be approximately normally distributed. 
This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30).
  • CDF(Cumulative Distribution function) of GaussianNormal distribution :¶

  • CDF always lies between 0 to 1.

  • CDF of Normal Distribution is 'S' shaped curve

  • Normal Distribution follow 68–95–99.7 rule.

image.png image-2.png

  • Symmetric distribution, Skewness and Kurtosis :¶

Q. What is Skewness ?

Ans : Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
  •    The skewness value can be positive, zero, negative, or undefined.
  • There are type type of Skewness :-

  1. Negative Skewness : The left tail is longer .

                     A left-skewed distribution usually appears as a right-leaning curve.
    
                    The mass of the distribution is concentrated on the right of the figure.
  2. Positive Skewness : The Right tail is longer .

                      A right-skewed distribution usually appears as a left-leaning curve.
    
                      The mass of the distribution is concentrated on the left of the figure.

image-4.png

Q. What is relation of Mean, Median & Mode with Skewness ?

Ans : The skewness is not directly related to the relationship between the mean and median.
  • (Mode-Mean) = 3*(Median - Mean)

image.png

Q. What is Kurtosis?

Ans : Kurtosis and Excess-Kurtosis are two different terms and create confusion.

    Excess_Kurtosis = Kurtosis - 3

Kurtosis of Normal distribution  = 3
Excess-Kurtosis of Normal distribtuion = 0
  • Platykurtic : excess-kurtosis < 0
  • Mesokurtic : excess-kurtosis = 0
  • Leptokurtic : excess-kurtosis > 0

image-2.png image.png

  • Summary :

  • Mean : tells about the location of distribution

  • Variance : tells about the spread of distribution

  • Skewness : tells about how dissimilar from symmetric disribution

  • Kurtosis : tells about the peak of distribution


  • Standard normal variate (Z) and standardization :-

Q. What is Standard normal variate ?

Ans :  A standard normal variate is a normal variate with mean µ=0 and standard deviation σ =1 with a probability density function f(z).
       It is denoted by 'Z' .
  • Standard normal variables play a major role in regression analysis, the analysis of variance and time series analysis.

image.png image-2.png

Q. What is standardization ?

Ans :  It is a tranformation method to change a normal distribution to Standard Normal Distribution .

        Not only Normal Random Varible , Given any Random variable with Mean & Standard deviation , We can convert into Standard Normal Variate using below formula

        Z = (X-µ)/σ

        where X = arandom variable which follow Normal Distribution
              Z = Standard Normal Variate
              µ = Mean
              σ = Standard Deviation
            

image.png

Q. Why We need of Standarization ?

Ans : After Standarization We can draw useful insights about the data.
  • Kernel density estimation using Gaussian Kernals:-

Q. What is KDE ?

Ans : Kernal Density Estimation is an unsupervised learning technique that helps to estimate the PDF of a random variable in a               non-parametric way.
  • It is related to a histogram but with a data smoothing technique.
  • KDE is convert Histogram into PDF (probability density function)
  • This technique allow us to create a smooth curve given a set of random data.
  • It can also be used to generate points that look like they came from a certain dataset - this behavior can power simple simulations, where simulated objects are modeled off of real data.

Q. Why KDE is required ?

Ans : Histogram are not smooth, they depend on the width of bins and the endpoints of the bins, KDE reduce the problem by 
providing smoother curves.

image.png

in this picture, red line curves are gaussian kernal (Mean for this gausaian kernal are the each data points and varinace also called as bandwidth)

Q. What should be the varinace for gaussian kernals(bandwidth) for gaussina kernals?
    Ans : 
         if bandwodth are small :  kernals will be so jag
         if bandwodth are large : kernals will be flat.

image-2.png

    In seaborn, there are some nice heuristics to find right bandwidth
  • Sampling Distribution & Central Limit Theorem :

Q. What is Sampling Distribution ?

Ans : 

    Distribution of Sample Means are called as Sampling Distribution.

image.png

image.png

Q. Is infinite Mean and Infinte Variable possible ? If Yes, then How can a distribution have infinite mean and variance?

Ans : best example is Pareto Distribution

image-2.png image.png

ref : https://stats.stackexchange.com/questions/91512/how-can-a-distribution-have-infinite-mean-and-variance

Q. Can You say Central Limit Theorem in one word ?

Ans : Yes;
    CLT States that :-
        "Sampling Distribution of Sample Mean follows Normal Distribution".      

image.png

Q. What is this term "Sampling Distribution of Sample Mean" ? What is its meaning ?

Ans : "Sampling Distribution of Sample Mean"  = Sampling  + Distribution of Sample Mean.

    Sampling : It is the process by using we have taken sample from population

    Distribution of Sample Mean : Plotting of mean for each sample 

Q. Why Central Limit Theorem is beautiful ?

Ans : because by looking out the only samples of data points, we can conclude about the whole population.

    The field of statistics is based upon the fact that it is rarely feasible or practical to collect all of the data from an entire population. Instead, we can gather a subset of data from a population, and then use statistics for that sample to draw conclusions about the population.

    to understand population mean and population varinace of any distribtion, We just need to know that they are finite & well-defined. 
    We can estimate by doing simple sampling exercise and computing the "Sampling Distribution of Sample Means which will follow Normal Distribution".
    So We will know, population Mean by computing Mean of Distribution of Sample Means population Mean  &
                     population Varinace by computing Varinace of Distribution of Sample Means.
  • by observation : when Sample Size (n) >30 , CLT applicable for any distribution
  • ref : https://www.minitab.com/content/dam/www/en/uploadedfiles/content/academic/CentralLimitTheorem.pdf
  • ref : https://stats.stackexchange.com/questions/541379/testing-the-central-limit-theorem-with-the-shapiro-wilk-test-on-dice-rolling-sim

image-2.png

Q. Is Central Limit Theorem applicable for every distribution ?

Ans : No; "the Central Limit Theorem doesn't work with every distribution. This is due to one sneaky fact — sample means are clustered around the mean of the underlying distribution if it exists. But how can a distribution have no mean? Well, one common distribution that has no finite mean in special cases is the Pareto distribution.
        
  • Note : CLT is valid only for those distribution which have Finite Mean & Finite Variance
  • Research Area : Apply CLT in the distribution which have infinte Mean & variance

Q. What is Central Limit Theorem ? Define Formally.

Ans : 

    The formal definition of central limit theorem states that:

    " For a population with mean (µ) and standard deviation (σ), if we take sufficiently large random samples from the population with replacement, then the distribution of the sample means (also known as sampling distribution of means) will approximate to the normal distribution. 
    This will hold true regardless of the distribution of the source population whether it is normal, skewed, uniform or completely random, provided the sample size (n) is sufficiently large (typically n > 30). When the population is normally distributed, the theorem holds true even for smaller sample size i.e. n < 30. "
    

image-2.png

  • Properties of CLT :
  1. μx̅ ~ μ , Mean of sample means(μx̅) will be approximately equal to the population mean (μ) .

  2. σx̅ = σ/√n , As we increase the sample size (n), the standard deviation of the sampling distribution of means (σx̅ = σ/√n) will become smaller.

    where,

    population mean : μ

    population standard deviation : σ

    sample mean : x̅i

    mean of all the sample means (x̅1, x̅2, x̅3,…, and x̅m) : μx̅

    standard deviation of all 'm' the sample means (x̅1, x̅2, x̅3,…, and x̅m) : σx̅

    size of each sample : n

Q.How to test if a random variable is normally distributed or not ?¶

Ans : there are many other technique(statistical testing like KS test , AD test(more powerful)) to know a Randome varibale is Normally distributed or not? , But Quantile-Quantile plot (Q-Q plot) is the simplest graphical method to know.

Q. How to plot Q-Q plot ? How to read Q-Q plot ?

Ans : 
     Random Varibale X = x1, x2, x3 , ... , x500
    
Step1 : Sort the data points and compute their percentile

Step2 : Create a Standard Normal Random varibale by using percentile of the it's sorted data points. {called as theoratical Qunantiles}

Step3  : Plot theoratical Qunantiles on X-axis Vs Random Varibale X on y-axis

image.png

  • Note : if all the points lies on the staright line then , Random Variable X and Y have similar Distribution, So we can say that if Y is Normally Distributed then X will also have a Normal Distribution.
In [ ]:
import numpy as np
import pylab 
import scipy.stats as stats


# N(0,1), Generate size = 1000 Random observation from Normal Distribtuion, Here loc=0 means mean = 0 , scale=1 means Std. Deviation =1
std_normal = np.random.normal(loc=0, scale= 1, size=1000)

#0 to 100th percentiles of std-normal
for i in range(0,101): 
    print(i,"th percentile = ",np.percentile(std_normal,i))
0 th percentile =  -3.499107240994038
1 th percentile =  -2.427182009233492
2 th percentile =  -2.070223435950522
3 th percentile =  -1.9625915961504172
4 th percentile =  -1.8209462403785646
5 th percentile =  -1.743883213710714
6 th percentile =  -1.6149053963860063
7 th percentile =  -1.5463815055112005
8 th percentile =  -1.481084317051865
9 th percentile =  -1.4069825089944539
10 th percentile =  -1.3322396618204373
11 th percentile =  -1.272714229737105
12 th percentile =  -1.1860768190894677
13 th percentile =  -1.1477211277530386
14 th percentile =  -1.1110624123209727
15 th percentile =  -1.0798864897965272
16 th percentile =  -1.0219101521089489
17 th percentile =  -0.9709299558307669
18 th percentile =  -0.9303770995467152
19 th percentile =  -0.9045841900330435
20 th percentile =  -0.8562534443322407
21 th percentile =  -0.8261124535907577
22 th percentile =  -0.7654617694760109
23 th percentile =  -0.7376966275274653
24 th percentile =  -0.7052229899922399
25 th percentile =  -0.676003680087186
26 th percentile =  -0.6372133640732198
27 th percentile =  -0.5881785838927646
28 th percentile =  -0.5543907999277767
29 th percentile =  -0.5188507209402796
30 th percentile =  -0.4857789676176328
31 th percentile =  -0.44409578639599284
32 th percentile =  -0.4205507764020476
33 th percentile =  -0.3947487620394244
34 th percentile =  -0.3674692731151177
35 th percentile =  -0.34559662140631875
36 th percentile =  -0.32787030042527915
37 th percentile =  -0.3002173974576282
38 th percentile =  -0.2832327122015649
39 th percentile =  -0.25690842460934715
40 th percentile =  -0.22226193981019468
41 th percentile =  -0.1898449388933453
42 th percentile =  -0.16322030105929394
43 th percentile =  -0.12634520458955492
44 th percentile =  -0.08241532746494765
45 th percentile =  -0.06198198123758298
46 th percentile =  -0.033097400036885286
47 th percentile =  -0.019720246741775926
48 th percentile =  -0.001753052277221307
49 th percentile =  0.026216469747475723
50 th percentile =  0.04494442974715856
51 th percentile =  0.06016656165354633
52 th percentile =  0.08854577182103474
53 th percentile =  0.11863109361413131
54 th percentile =  0.13893193534723275
55 th percentile =  0.1676792880252538
56 th percentile =  0.19501902424641532
57 th percentile =  0.21008010413821349
58 th percentile =  0.22920822468730945
59 th percentile =  0.25296205335820665
60 th percentile =  0.27352446745698034
61 th percentile =  0.3120924784175149
62 th percentile =  0.3236996010028199
63 th percentile =  0.33746884025595736
64 th percentile =  0.3675352233038593
65 th percentile =  0.380599262095988
66 th percentile =  0.4179182525790814
67 th percentile =  0.45735012309443296
68 th percentile =  0.48354560673864694
69 th percentile =  0.5334410575367026
70 th percentile =  0.5768445460856778
71 th percentile =  0.6037820090453173
72 th percentile =  0.6271377017496592
73 th percentile =  0.6497411037527893
74 th percentile =  0.6724994332800354
75 th percentile =  0.7175587082151655
76 th percentile =  0.7500539789886305
77 th percentile =  0.7769110464192663
78 th percentile =  0.8144615373916797
79 th percentile =  0.8563242439676142
80 th percentile =  0.8879321065945307
81 th percentile =  0.9185637239401587
82 th percentile =  0.9544595969665366
83 th percentile =  0.9802688928648484
84 th percentile =  1.0142445759676963
85 th percentile =  1.0524550503578058
86 th percentile =  1.1061472462394062
87 th percentile =  1.1559903749964504
88 th percentile =  1.2263073763598396
89 th percentile =  1.3171644194790868
90 th percentile =  1.3511737703263567
91 th percentile =  1.397394030176791
92 th percentile =  1.4755717075799832
93 th percentile =  1.554195186098769
94 th percentile =  1.5958476925024399
95 th percentile =  1.68312169911816
96 th percentile =  1.820991217729968
97 th percentile =  1.9557747351016819
98 th percentile =  2.232644951437211
99 th percentile =  2.439658572200899
100 th percentile =  3.403573917938617
In [ ]:
# generate 100 sample from N(20,5)
measurement = np.random.normal(loc=5,scale=20,size=100)
#try size = 50, 1000 , As sample size increases , more & more points will be on staright line

# Limittaion of Q-Q plot : hard to interpert any conclusion when Sample Size is small

#Q-Q plot : compare with standard nomral variable
stats.probplot(measurement,dist="norm",plot=pylab)
pylab.show()

# Since they are co-linear , So Both have same type of distribution.
No description has been provided for this image
In [ ]:
# Generate 100 sample from uniform distribution
measurement = np.random.uniform (low=-1,high=1, size =100)
# try size  = 50 ,1000
stats.probplot(measurement, dist="norm",plot=pylab)
pylab.show()

# here we observer that , with less no of data points , It is difficult to interpret 
# variable is normal or not ?  
No description has been provided for this image

-Advantage of Q-Q Plot :

  1. To check A Random varible is Normallly Distributed or not ?
  2. Does random variables X, Y have same type of distribution?

image.png

Q. How/Where/When to use Distribtuions in Real World ?¶

Ans : All these Probability & Statistical Tools are used in Exploratory Data Analysis.

  • Data Analysis is nothing but Answering the questions about data.

  • All these distribution helps in Exploratory Data Analysis question about data.

    Gaussian distribution is theoratical model of distributon of data that is observed in many Natural phenomena. We can use itto get insight easily.

Chebyshev’s inequality : (It is valid for any distribution)

if X is any random variable with finite mean and non-zero finite standard deviation, then

image.png

image.png

Note : In other words, Use below form of this formula

image-2.png

  • Discrete and Continuous Uniform distributions :

1. Discrete Unifrom Distribution :

 If a random varibale is discrete and it follows Unifrom distribution , Then It is called as Discrete Random Distribution.

- PMF : Probability Mass Funtion is for Discrete random varibale

  • In Unifrom Distribution all the events are equi-probable.
  • It is symmetric distribution
  • It is not skewed , so it's skewness is 0
  • Ex : throwing a dice , tossing a coin

image.png image-2.png

image-3.png

2. Continuous Unifrom Distribution :

 If a random varibale is continuous and it follows Unifrom distribution , Then It is called as Continuous Random Distribution.

- PDF : Probability Density Funtion is for Continuous random varibale

image-4.png

image.png image-2.png

image-3.png

In [ ]:
#Example : random function generate random data points between (0,1) using Uniform Distribution
import random
print(random.random())
0.8029317742016568

Q. How to randomly sample data points?

Ans : Using Unifrom Distribution, We generates random numbers with all have equal probability

- Most Random Number generator follows Uniform Distribution

In [ ]:
# load IRIS dataset with 150 points
from sklearn import datasets
iris = datasets.load_iris()
d = iris.data
d.shape
Out[ ]:
(150, 4)
In [ ]:
# Sample 30 data points randomly from the 150 points dataset

n=150;
m=30;
p=m/n;

sample_data =[];

for i in range(0,n):
    if random.random() <=p:
        sample_data.append(d[i:1])
        
len(sample_data) # size of random sample is roughly 30, not exactly 30, Try out
Out[ ]:
29
  • Bernoulli Distribution :

  • It is discrete
  • It is used when we have only two outcomes. Ex:- tossing a coin

image.png image-3.png

  • Bionomial Distribution :

  • It is discrete
  • Bernoulli random varible with n trails

image.png image-2.png

image-3.png

  • Log-Normal Distribution :

  • A Random Varibale X is said to be Log-Normal Distribution if log(X) follow Normal Distribution.

  • Also Called as Galton distribution

  • Continuous Probability Distribution

  • Not Symmetric

  • Just like Normal Distribution , It also has two parameters : Mean & Varinace

image.png image-2.png

image-3.png

Example of Log-Normal Distribution in Real Life :

  1. The length of comments posted in Internet discussion forums follows a log-normal distribution. (Most of the commnets are small ,only some of the comments have large length )

  2. Measures of size of living tissue (length, skin area, weight).

  3. In economics, there is evidence that the income of 97%–99% of the population is distributed log-normally. (The distribution of higher-income individuals follows a Pareto distribution).

  4. In scientometrics, the number of citations to journal articles and patents follows a discrete log-normal distribution

  • Fun Part 😀 : Love ❤ Relationships💑 follow Log-Normal Distribution.

Q. How to know a Random Varibale follow Log-Normal Distribution ?

Ans : take log of all data ponits and then plot Q-Q plot

image.png

Power Law :

  • Occures lot in Nature
  • When a distribution follow Power Law, then It is called as Pareto Distribution

image-2.png

Pareto distribution :

  • Continuous distribution

image.png image-2.png

image-3.png

Example of Pareto Distribution in Real Life :

  1. Sizes of sand particles

  2. The length distribution in jobs assigned to supercomputers (a few large ones, many small ones)

  3. The sizes of human settlements (few cities, many hamlets/villages)

  4. File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few larger ones)

Q. How to know Something is following Power Law or not ?

Ans : Instead of ploting X vs Y, plot log(X) Vs log(Y), if you get straight line ,then It will follow Power Law ,It works most of the time , not always

image.png

Q. How to know a random Variable follow Pareto Distribution ?

Ans : by using Q-Q plot [make one is pareto and other is your observation]
  • Box-Cox Transformation :

We have seen that , Given a random varibable which was log-normally distributed , We can easily converted it into Gaussian Distributed random varibale Y by taking natural log of all data points.

  • In Machine Learning, Most of the time we try assume or to convert other random variable into Gaussian Random variable,Because We can derive many insights from it.

image.png

_____________________ See Part 2 _____________________________________________

In [ ]: